Dimension Reduction

A Quick Tour with Examples

USM Data Science

2026-02-04

Class resources


What is Dimension Reduction?

Consider a typical machine learning situation

  • one response variable (or none - unsupervised)
  • one or more predictors (features, variables)

penguins data1.

species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year
Adelie Torgersen 38.7 19.0 195 3450 female 2007
Chinstrap Dream 46.6 17.8 193 3800 female 2007
Gentoo Biscoe 49.5 16.2 229 5800 male 2008
Adelie Torgersen 35.1 19.4 193 4200 male 2008
Adelie Biscoe 42.7 18.3 196 4075 male 2009
Adelie Dream 36.3 19.5 190 3800 male 2008

We are interested in the dimension of the predictor space.

Think linear algebra

Dimension of the column space

An Example

Sakar et al. (2019)

  • 252 patients, 188 of whom had a previous diagnosis of Parkinson’s disease, 64 without.
  • speech signal processing techniques yield 750 numerical predictors, or features.


Goal: classify Parkinson’s status.


Can we fit a logistic regression model?

The parkinsons design matrix

  • 751 rows and 252 + 1 = 253 columns (one for the intercept)

  • More columns than rows!

  • So data matrix \(X\) not full rank - a problem for some ML models

  • \(p>>n\) is common in many ML applications (e.g., genomics, image analysis, text mining, etc.)

Feature selection - one form of dimension reduction

Even if \(X\) full rank, we might want eliminate some predictors to decrease model variance.

  • near - zero variance predictors

  • highly correlated predictors (with each other)

  • predictors uncorrelated with the response

Many methods…

Another approach: Feature engineering

More generally, can we transform and combine predictors in various ways to form a smaller set of new features?

The goal of dimension reduction

Why do dimension reduction?

  • Reduce model variance - mitigate multicollinearity
  • Reduce model complexity
  • Improve model interpretability
  • Reduce computational cost
  • Mitigate the “curse of dimensionality”

Much of this is model dependent

Also used for

  • Visualize high-dimensional data
  • data compression
  • autoencoders - representation learning

Curse of dimensionality

Curse of dimensionality

Suppose KNN with \(p\) predictors \(\sim U(0,1)\)

  • volume of hypercube with sides of length \(d\) is \(d^p\).

  • To capture \(r\) proportion of the data, we want \(d^p = r\), so \(d = r^{1/p}\)

We are using the \(L_{\infty}\) norm (max norm) to define neighborhood here, but argument is similar for other norms.

Feature selection/engineering/extraction and Dimension reduction

Feature “selection”

  • select a subset of existing predictors or features

Feature engineering

  • create new predictors or features from existing ones- transformations, combinations, interactions, polynomials, etc.

Feature extraction

  • create a lower dimensional representation that is structurally similar to the original predictors

Principal Components Analysis (PCA)

A Canonical Example

Principal Components Analysis (PCA)

Create features - components from linear combinations of the predictors

  • The first component is in the direction of maximum variance of the data*.

  • Second component in direction of max variance orthogonal (and thus uncorrelated) to the first, and so on.

*It thus important to scale predictors first if they are on different scales.

PCA is unsupervised!

PCA - the math

Predictors \(X_1, X_2, \dots, X_p\); form linear combinations

\[ Z_m = \sum_{j=1}^p \phi_{jm} X_j, \qquad m = 1, 2, \dots, p \]

In matrix form,

\[ Z = X \mathbf{U} \]

Now with data

\[ \mathbf{Z} = \mathbf{XU} \]

Principal components and eigenvectors

First scale or standardize the data matrix.

The principal components - the columns of \(\mathbf{U}\) - are the eigenvectors of \(\mathbf{X}^T \mathbf{X}\).

(Equivalently, the eigenvectors of the correlation matrix - standardizing is automatic.)

Theorem

  • Eigenvectors of \(\mathbf{X}^T \mathbf{X}\) are orthogonal

  • Let \[ \lambda_1 \ge \lambda_2 \ge \cdots \ge \lambda_p > 0 \] with eigenvectors \(\mathbf{v}_1, \dots, \mathbf{v}_p\)

    • \(\lambda_i\) is variance in direction \(\mathbf{v}_i\)
    • \(\mathbf{v}_1\) maximizes variance
    • \(\mathbf{v}_2\) maximizes variance subject to orthogonality with \(\mathbf{v}_1\)
    • and so on

Principal componponents regression (PCR)

  1. Center and scale the data

  2. Form \(\mathbf{U}\), the eigenvectors of centered and scaled \(\mathbf{X}^T \mathbf{X}\), ordered by decreasing eigenvalues

  3. Compute \(\mathbf{Z} = \mathbf{XU}\)

  4. Select the number of components, \(m\) based on explained variance (or cross-validation).

  5. Fit a linear regression using the first \(m\) columns of \(\mathbf{Z}\) as predictors.

For \(m = p\), PCR equals ordinary least squares

In practice we will typically use software libraries to find the components

Example: wine data (UCI ML Repository)

Chemical analysis of wines grown in the same region in Italy but derived from three different cultivars. The analysis determined the quantities of 13 constituents found in each of the three types of wines.


Class_label Alcohol Malic_acid Ash Alcalinity_of_ash Magnesium Total_phenols Flavanoids Nonflavanoid_phenols Proanthocyanins Color_intensity Hue OD280_OD315_of_diluted_wines Proline
1 14.2 1.71 2.43 15.6 127 2.80 3.06 0.28 2.29 5.64 1.04 3.92 1065
1 13.2 1.78 2.14 11.2 100 2.65 2.76 0.26 1.28 4.38 1.05 3.40 1050
1 13.2 2.36 2.67 18.6 101 2.80 3.24 0.30 2.81 5.68 1.03 3.17 1185
1 14.4 1.95 2.50 16.8 113 3.85 3.49 0.24 2.18 7.80 0.86 3.45 1480
1 13.2 2.59 2.87 21.0 118 2.80 2.69 0.39 1.82 4.32 1.04 2.93 735
1 14.2 1.76 2.45 15.2 112 3.27 3.39 0.34 1.97 6.75 1.05 2.85 1450

Multinomial regression on the PCA features

We get perfect score with just 5 components, and 98% with only 3 components.


num_pca_components accuracy
1 0.848315
2 0.960674
3 0.983146
4 0.983146
5 1.000000
6 1.000000
7 1.000000
8 1.000000
9 1.000000
10 1.000000
11 1.000000
12 1.000000
13 1.000000

PCA as a preprocessing step


  • on select predictors - not all or nothing.

  • preprocessing/feature engineering step for any ML model

  • Use tidymodels::recipes, sklearn PCA, prcomp

Dimension reduction for visualization, understanding global structure, variable relationships

PCA on USArrests dataset1


Murder Assault UrbanPop Rape
Alabama 13.2 236 58 21.2
Alaska 10.0 263 48 44.5
Arizona 8.1 294 80 31.0
Arkansas 8.8 190 50 19.5
California 9.0 276 91 40.6
Colorado 7.9 204 78 38.7

Loadings and scores


PCA loadings; the \(\mathbf{U}\) values

PC1 PC2 PC3 PC4
Murder -0.536 -0.418 0.341 0.649
Assault -0.583 -0.188 0.268 -0.743
UrbanPop -0.278 0.873 0.378 0.134
Rape -0.543 0.167 -0.818 0.089


Cumulative normalized variances

[1] 0.620060 0.867502 0.956642 1.000000


PCA scores, \(\mathbf{Z} = \mathbf{XU}\)

PC1 PC2 PC3 PC4
Alabama -0.975660 -1.122001 0.439804 0.154697
Alaska -1.930538 -1.062427 -2.019500 -0.434175
Arizona -1.745443 0.738460 -0.054230 -0.826264
Arkansas 0.139999 -1.108542 -0.113422 -0.180974
California -2.498613 1.527427 -0.592541 -0.338559
Colorado -1.499341 0.977630 -1.084002 0.001450
Connecticut 1.344992 1.077984 0.636793 -0.117279
Delaware -0.047230 0.322089 0.711410 -0.873113
Florida -2.982760 -0.038834 0.571032 -0.095317
Georgia -1.622807 -1.266088 0.339018 1.065974

Visualizing the loadings

Biplots

We can visualize the loadings and the scores together with a biplot.

Other matrix factorizations for feature extraction

  • Non-negative Matrix Factorization (NNMF)

  • Independent Component Analysis (ICA)

Non-linear dimension reduction

Uniform manifold approximation and projection (UMAP)



Constructs a graph representation of the data in high dimensions, then optimizes a low-dimensional graph to be as structurally similar as possible.

Uniform manifold approximation and projection (UMAP)

penguins revisited


We use just the four numerical predictors

species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year
Adelie Torgersen 38.7 19.0 195 3450 female 2007
Chinstrap Dream 46.6 17.8 193 3800 female 2007
Gentoo Biscoe 49.5 16.2 229 5800 male 2008
Adelie Torgersen 35.1 19.4 193 4200 male 2008
Adelie Biscoe 42.7 18.3 196 4075 male 2009
Adelie Dream 36.3 19.5 190 3800 male 2008


PCA vs UMAP on penguins


UMAP may be overkill here

UMAP

mnist dataset - images of handwritten digits (0-9). 784 features (28 x 28 pixels)


mnist samples

PCA vs UMAP on mnist

mnist dataset - images of handwritten digits (0-9). 784 features (28 x 28 pixels)


UMAP main hyperparameters


  • n_neighbors - number of neighboring points used in local approximations.

  • min_dist - minimum distance between points in low-dimensional space

These control the balance between local and global structure in the data.

More on UMAP

Comparing PCA, UMAP, and t-SNE

Source: Biostatsquid

Neural Networks

Neural Networks

neural network diagaram

Source: Gerhard Paab blog

autoencoder diagaram Source: S. Raschka Deep Learning Learning

Coding notes

  • prcomp in base R, step_pca() recipes step in recipes package (part of tidymodels)

  • sklearn.decomposition.PCA

  • umap package in R; as a recipes() step

  • umap-learn not part of sklearn but part of sklearn ecosystem


See my slides source code as well as example R and ipynb files for example code.

Recap

  • Why dimension reduction matters

    • High-dimensional predictors and rank deficiency
    • Variance, multicollinearity, interpretability, computational cost
    • The curse of dimensionality (analytical + simulation intuition)
  • Feature selection, engineering, extraction

  • Principal Components Analysis (PCA)

    • Mathematics; eigenvectors and eigenvalues
    • Principal components regression (PCR)
    • PCA loadings and scores
  • Nonlinear dimension reduction

    • UMAP, neural networks